\(\newcommand{\mathds}[1]{\mathrm{I\hspace{-0.7mm}#1}}\) \(\newcommand{\bm}[1]{\boldsymbol{#1}}\) \(\newcommand{\bms}[1]{\boldsymbol{\scriptsize #1}}\) \(\newcommand{\proper}[1]{\text{#1}}\) \(\newcommand{\pE}{\proper{E}}\) \(\newcommand{\pV}{\proper{Var}}\) \(\newcommand{\pCov}{\proper{Cov}}\) \(\newcommand{\pACF}{\proper{ACF}}\) \(\newcommand{\I}{\bm{\mathcal{I}}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\pP}{\proper{P}}\) \(\newcommand{\pAIC}{\textsf{AIC}}\) \(\DeclareMathOperator{\diag}{diag}\)

1  Probability Theory for Data Scientists

1.1 Set theory Concepts

NoteDefinition: Sample Space, Event, and Empty Set

Definition 1.1 Consider an uncertain scenario. This include a random experiment, a data-generating process or simply the future. We define the following concepts:

  • Sample Space (\(\Omega\)): The set of all possible outcomes or results from the scenario. Sample spaces can be either countable or uncountable. If the elements of a sample space can be put into one-to-one correspondence with the set of integers, the sample space is countable. If the sample space contains only a finite number of elements, it is also countable. Otherwise is uncountable.

  • Event: A subset of the sample space. It represents a specific outcome or a collection of outcomes of interest.

  • Empty Set (\(\emptyset\)): A set containing no elements. It represents an impossible event.

Example 1.1 If we flip a coin twice then the sample space can be written as: \[ \Omega =\{HH,HT,TH,TT\} \]

where \(H\) represents heads and \(T\) tails. This sample space is finite. An event (say \(A\)) could be at least one head appears, that is

\[A =\{HT,TH,HH\}\subset \Omega\]

Example 1.2 If we are analyzing customer purchase behavior for a single online transaction, the sample space could be the set of all possible combinations of items a customer might select from the store’s catalog. This sample space is in principle finite and therefore countable. However, if the catalog is very large, the sample space can be considered uncontably large for practical purposes. More on this later.

An event could be “customer buys at least one item from category X”, or “customer buys product Y”.

Example 1.3 We measure the time (in seconds) it takes for a user to complete a task on a website. The time limit is predefined at 5 minutes. Then the sample space is \(\Omega = \{0, 1, 2, 3,\ldots, 300\}\) which is finite. If, however, we measure the time with arbitrary precision, then the sample space is the interval \((0,300)\) of real numbers. This sample space is uncountable.

An event could be “user completes the task in under 2 minutes”. In the former case, this corresponds to the set \(A=\{1,2\ldots, 119 \}\). Int he latter case is the real interval \(A=(0, 120)\).

Events can be described in many different ways. We will use set theory and notation to describe events and operations on events. This can help later in the computation of probabilities.

NoteBasic Set Operations

Definition 1.2 Given events \(A,B,C\) in the sample space \(\Omega\):

  • Union (\(A \cup B\)): The event that \(A\) occurs, or \(B\) occurs, or both occur.

  • Intersection (\(A \cap B\)): The event that both \(A\) and \(B\) occur.

  • Complement (\(A^c\)): The event that \(A\) does not occur. It is the set of all outcomes in \(\Omega\) that are not in \(A\).

The following propertties hold for any events \(A, B, C\):

  • Commutativity:
    • Union: \(A \cup B = B \cup A\)
    • Intersection: \(A \cap B = B \cap A\)
  • Associativity:
    • Union: \((A \cup B) \cup C = A \cup (B \cup C)\)
    • Intersection: \((A \cap B) \cap C = A \cap (B \cap C)\)
  • Distributive Laws:
    • Intersection over Union: \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\)
    • Union over Intersection: \(A \cup (B \cap C) = (A \cup B) \cap (A \cup C)\)
  • De Morgan’s Laws:
    • \((A \cup B)^c = A^c \cap B^c\)
    • \((A \cap B)^c = A^c \cup B^c\)
NoteDisjoint Sets and Partitions of Sample Space

Definition 1.3  

  • Disjoint Sets (Mutually Exclusive Events): Two sets \(A\) and \(B\) are disjoint if they have no elements in common, i.e., \(A \cap B = \emptyset\).

  • Partition: A collection of non-empty, disjoint subsets (events) of \(\Omega\) whose union is \(\Omega\). That is \(A_1,A_2, \ldots\) is a partition if

\[ \bigcup_{i} A_i = \Omega \quad \text{and} \quad A_i \cap A_j = \emptyset \text{ for } i \ne j \]

NoteRepresentation of events using set operations

Example 1.4 When we flip a coin twice, the event \(A\) “at least one head appears” can be written in various ways. These include:

  • the union of three events \(A = \{HT\} \cup \{TH\} \cup \{HH\}\). That is, \(A\) occurs if we get heads on the first flip and tails on the second flip, or tails on the first flip and heads on the second flip, or heads on both flips. Note that these three events are disjoint as they do not share any outcomes.

  • the union \(A = A_1 \cup A_2\) where \(A_1 = \{HT, HH\}\) is the event “head on first flip” and \(A_2 = \{TH, HH\}\) is the event “head on second flip”. Note that \(A_1\) and \(A_2\) are not disjoint as they both contain the outcome \(HH\).

  • the complement \(A = B^c\) where \(B = \{TT\}\) is the event “no heads appear”.

Three different partitions of the sample space are given by:

  • The trivial partition where each event contains a single outcome: \[ \mathcal P_1=\{\{HT\},\{TH\},\{HH\},\{TT\}\} \]

  • The partition: \[ \mathcal P_{equal}=\{\{HH,TT\}, \{HT,TH\}\} \] that is, when we flip the coin twice, either we get the same results in both throws OR different ones.

  • The partition where we group the outcomes based on the number of heads:

\[ \mathcal P_{heads} =\{\{TT\}, \{HT,TH\}, \{HH\}\} \]

that is, when we flip the coin twice, we can get no heads, one head or two heads.

NoteSigma Algebra

Definition 1.4 A collection \(\mathcal{F}\) of subsets of \(\Omega\) is a sigma algebra (or \(\sigma\)-algebra) if it satisfies the following properties:

  1. \(\Omega \in \mathcal{F}\) (The sample space is in the collection).
  2. If \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\) (The collection is closed under complementation).
  3. If \(A_1, A_2, \dots\) are in \(\mathcal{F}\), then \[ \bigcup_i A_i \in \mathcal{F} \]

that is, the collection is closed under arbitray number of unions.

Note the definition of sigma-algebra does not explicitly require that the intersection of two sets in \(\mathcal F\) is also in \(\mathcal F\). However, this property follows from the other properties and De Morgan’s laws. If \(A,B \in \mathcal F\), then

\[ \cancel{A\cup B \in \mathcal F \implies (A\cup B)^c = A^c \cap B^c \in \mathcal F \implies (A^c \cap B^c)^c = A \cup B \in \mathcal F} \]

\[ A^c \in \mathcal F\,,B^c \in \mathcal F \implies A^c \cup B^c \in \mathcal F \implies (A^c \cup B^c)^c= A \cap B\in \mathcal F \]

(corrected from previous version)

NoteExamples of sigma-algebras

Example 1.5 The trivial sigma algebra is clearly \(\mathcal F_0=\{\emptyset, \Omega\}\) which does not seem very useful.

The partition \(\mathcal P_{equal}=\{\{HH,TT\}, \{HT,TH\}\}\) above, is not a sigma-algebra as it does not contain the empty set. If we add the empty set, then is still not a sigma algebra as it is not closed under union. The union of the only two elements is \(\Omega\). If we include \(\Omega\) then we have the sigma algebra:

\[ \mathcal F_{equal}=\{\emptyset, \Omega, \{HH,TT\}, \{HT,TH\}\} \]

The partition \(\mathcal P_{heads}\) above is also not a sigma algebra but if we add all possible unions then we obtain the sigma algebra:

\[ \begin{aligned} \mathcal F_{heads}& =\{\emptyset, \Omega, \{TT\}, \{HT,TH\}, \{HH\}, \{HT,TH,HH\},\\ & \{HT,TH,TT\}, \{HH,TT\}\} \end{aligned} \]

The set \[ \mathcal G =\{\emptyset,\Omega,\{HT\},\{TH\},\{HH\},\{TT\}\} \]

obatined from \(\mathcal P_1\) is neither a partition nor a sigma algebra as it is not closed under union. For example, \(\{HT\}\cup \{TH\}=\{HT,TH\}\notin \mathcal G\). However, if we add all possible unions of the elements of \(\mathcal G\) we obtain the power set of \(\Omega\), that is the set of all subsets of \(\Omega\):

\[ \begin{aligned} \mathcal F_{max} &= \{\emptyset, \{HT\},\{TH\},\{HH\},\{TT\},\\ & \{HT,TH\}, \{HT,HH\}, \{HT,TT\}, \{TH,HH\}, \{TH,TT\}, \{HH,TT\}\\ & \{HT,TH,HH\}, \{HT,TH,TT\}, \{HT,HH,TT\}, \{TH,HH,TT\},\Omega \} \end{aligned} \]

This is the largest possible sigma-algebra for this sample space. It has \(2^4=16\) elements since the sample space has 4 elements. In general, if the sample space has \(n\) elements, then its power set has \(2^n\) elements.

Also generally, if we have a finite partition of \(\Omega\) then the collection of all unions of sets in the partition (including the empty set) is a sigma-algebra.

Note that different sigma algebras serve for different purposes. For example, the sigma algebra \(\mathcal F_{equal}\) is useful if we are only interested in whether the two coin flips are the same or different. The sigma algebra \(\mathcal F_{heads}\) is useful if we areinterested in the number of heads. The power set \(\mathcal F_{max}\) is a sigma algebra that me be more useful if we are interested in all possible events.

In this example we also observe that:

\[ \mathcal F_0 \subset \mathcal F_{equal} \subset \mathcal F_{heads} \subset \mathcal F_{max} \]

so that \(\mathcal F_0\) and \(\mathcal F_{max}\) are the smallest and largest sigma algebras possible for this sample space.

1.2 Probability

We will start by defining probability in an intuitive way. Later we will give a more formal mathematical definition .

1.2.1 Types of Probability

There are several ways to think about probability. These include

  • Classical Probability: Assumes all possible outcomes in a finite sample space are equally likely. That is, for any event \(A\) with \(n(A)\) outcomes in a sample space \(\Omega\) with \(n(\Omega)\) equally likely outcomes, the probability of \(A\) is: \[ P(A) = \frac{n(A)}{n(\Omega)} \]

    Example 1.6 Under this framework, the probability of rolling an even number on a die is assigned to be \(P(\text{rolling an even number}) = \frac{3}{6}\). More, generally this is equivalent to say the die is fair. Another example is when we assign the probability of rain tomorrow, locally at 10 AM, to be 1/2 as there are only two possible outcomes: rain or no rain.

  • Empirical (or Frequentist) Probability: Based on observed frequencies from repeated experiments. As the number \(N\) of experiment repetitions increases, the probability of an event \(A\) approaches the true probability: \[ P(A) \approx \frac{\text{Number of times A occurred}}{N} \]

    Example 1.7 If we do not what the probability of heads when flipping a coin is. We can we flip the coin 1000 times and if it lands heads 537 times, we would say the empirical probability of heads is \(0.537\). Furthermore we might say that the true probability of heads is \(\approx 0.537\) and the important aspect of thios framework is that, in theory, the more times we flip the coin the closer the empirical proportion will be to the true probability. Finally, according to historical data for our location, it has rained 33.6% of the days out of the last 10 years. The empirical probability of rain tomorrow locally at 10 AM is 0.336.

  • Subjective Probability: Based on personal belief or judgment, often used when objective data is scarce.

    Example 1.8 I had a look through the window and is a bit overcast, then I believe the probability of rain tomorrow locally at 10 AM is 0.7. On the other hand, if I am a weather expert from the point of atmospheric physics, I might believe the probability of rain tomorrow locally at 10 AM is 0.9.

1.2.2 Formal definition of probability

After we have chosen a sigma algebra \(\mathcal F\) that contains events we are interested in, we can define probabilities for all the events in a more formal way.

NoteProbability Measure (Kolmogorov’s Axioms)

Definition 1.5 A probability measure \(P\) on a sample space \(\Omega\) with a \(\sigma\)-algebra \(\mathcal{F}\) is a function \(P: \mathcal{F} \to [0, 1]\) that assigns a probability to each event in \(\mathcal{F}\) and satisfies the following three axioms:

  1. Non-negativity: For any event \(A \in \mathcal{F}\), \(P(A) \ge 0\). The probability of any event is non-negative.

  2. Normalization: \(P(\Omega) = 1\). The probability of the entire sample space (the certain event) is 1.

  3. Additivity (for disjoint events): If \(A_1, A_2, \dots, A_n\) are disjoint events in \(\mathcal{F}\) (i.e., \(A_i \cap A_j = \emptyset\) for \(i \ne j\)), then \[ P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i=1}^n P(A_i) \] For a countably infinite sequence of disjoint events, this extends to: \[ P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i) \] The probability of the union of disjoint events is the sum of their individual probabilities.

NoteProbability measure for the equality of two coin flips

Example 1.9 For the sigma algebra \(\mathcal F_{equal}=\{\emptyset, \Omega, \{HH,TT\}, \{HT,TH\}\}\) we can define a probability measure simply by specifying:

  • \(P(\emptyset) = 0\)
  • \(P(\{HH,TT\}) = 0.4\)

Note we can compute the probability of the other two events in \(\mathcal F_{equal}\) using the axioms:

  • \(P(\Omega) = 1\) (by axiom 2)

  • \(P(\{HT,TH\}) = P(\Omega) - P(\{HH,TT\}) = 1 - 0.4 = 0.6\) (by axiom 3)

The assignment of probability of \(\{HH,TT\}\) to be \(0.4\) maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteProbability measure for the number of heads in two coin flips

Example 1.10 For the sigma algebra \(\mathcal F_{heads}\) we can define a probability measure simply by specifying:

  • \(P(\{HT,TH\}) = 0.5\)
  • \(P(\{TT\}) = 0.1\)

The probabilities for the rest of the event in \(\mathcal F_{heads}\) can be computed using axiom 3 as follows:

  • \(P(\{HH\}) = 1-0.1-0.5 = 0.4\)
  • \(P(\{HT,TH,HH\}) = P(\{HT,TH\}) + P(\{HH\}) = 0.5 + 0.4 = 0.9\)
  • \(P(\{HT,TH,TT\}) = P(\{HT,TH\}) + P(\{TT\})= 0.5 + 0.1 =0.6\)
  • \(P(\{HH,TT\}) = 0.1 +0.4 = 0.5\)
  • \(P(\Omega) = 1\) (Trivial but good to double check in practice)
  • \(P_{heads}(\emptyset) = 1-1=0\) (Trivial, always true)

As before the probability assignment maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteProbability measure for power set

Example 1.11 For the largest sigma algebra \(\mathcal F_{max}\) we can define a probability measure simply by specifying probabilities for the four singletons or atoms:

  • \(P(\{HH\}) = 0.3\)
  • \(P(\{HT\}) = 0.2\)
  • \(P(\{TH\}) = 0.4\)

The probabilities for the rest of the events in \(\mathcal F_{max}\) can be computed using the axioms as follows:

  • \(P(\{TT\}) = 1-0.3-0.2-0.4 = 0.1\)
  • \(P(\{HT,TH\}) = 0.2 + 0.4 = 0.6\)
  • \(P(\{HT,HH\}) = 0.2 + 0.3 = 0.5\)
  • \(P(\{HT,TT\}) = 0.2 + 0.1 = 0.3\)
  • \(P(\{TH,HH\}) = 0.4 + 0.3 = 0.7\)
  • \(P(\{TH,TT\}) = 0.4 + 0.1 = 0.5\)
  • \(P(\{HH,TT\}) = 0.3 + 0.1 = 0.4\)
  • \(P(\{HT,TH,HH\}) = 0.2 + 0.4 + 0.3 = 0.9\)
  • \(P(\{HT,TH,TT\}) = 0.2 + 0.4 + 0.1 = 0.7\)
  • \(P(\{HT,HH,TT\}) = 0.2 + 0.3 + 0.1 = 0.6\)
  • \(P(\{TH,HH,TT\}) = 0.4 + 0.3 + 0.1 = 0.8\)

As before the probability assignment maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteSimple Probability Operations

Proposition 1.1 From the axioms, we can derive several useful properties:

  • Probability of the Complement: For any event \(A \in \mathcal{F}\), \[ P(A^c) = 1 - P(A) \]

  • Probability of the empty set: \(P(\emptyset) = 0\).

  • Probability of the Union of Two Events (General): For any two events \(A, B \in \mathcal{F}\): \[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] This is known as the addition rule. It accounts for the overlap between events.

NoteProbability of the union

Example 1.12 \[ \begin{aligned} P(\{HT,TH,TT\}\cup\{HH,TT\})&=P(\{HT,TH,TT\})+P(\{HH,TT\})-P(\{TT\})\\ & = 0.6+0.5-0.1\\ & =1 \end{aligned} \]

clearly this is correct as the union of these two events is \(\Omega\).

NoteBoole and Bonferroni inequalities

Theorem 1.1  

  • Boole’s inequality For any events \(A_1, A_2, \ldots, A_n\) in \(\mathcal F\): \[ P\left(\bigcup_{i=1}^n A_i\right) \le \sum_{i=1}^n P(A_i) \] This inequality provides an upper bound for the probability of the union of events.

    Bonferroni Inequality: For any events \(A_1, A_2, \ldots, A_n\) in \(\mathcal F\): \[ P\left(\bigcap_{i=1}^n A_i\right) \ge 1 - \sum_{i=1}^n P(A_i^c) \] This inequality provides a lower bound for the probability of the intersection of events.

These inequalities, specially Bonferroni’s will be useful later. Booles inequality can be proved by induction and Bonferroni’s inequality follows from Booles inequality and the properties of complements. These facts can be verified by the reader.

1.3 Conditional Probability

NoteConditional Probability

Definition 1.6 The conditional probability of event \(A\) given that event \(B\) has occurred, denoted \(P(A|B)\), is defined as: \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \] provided that \(P(B) > 0\). This measures the probability of event \(A\) occurring, knowing that event \(B\) has already happened.

NoteExample of Conditional Probability

Example 1.13 What is the probability of getting heads on the first coin flip GIVEN that at least one head appears in two flips? This can be expressed as \(P(A|B)\) where \(A=\{HT,HH\}\) is “head on first flip” and \(B=\{HT,TH,HH\}\) is “at least one head appears”. We have:

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} =\frac{P(A)}{P(B)}= \frac{P(\{HT,HH\})}{P(\{HT,TH,HH\})} \]

since \(A\subset B\) in this case. We notice a subtlety here. The event \(A =\{HT,HH\}\) (head on the first flip) is not a member of the sigma-algebra \(\mathcal F_{heads}\). So cannot use the probability measure \(P\) from Example 1.10 to compute this conditional probability. However, it is a member of the sigma algebra (the power set) \(\mathcal F_{max}\) so we might need to use the probabilities using such sigma algebra (see Example 1.11) as follows:

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(\{HT,HH\})}{P(\{HT,TH,HH\})} = \frac{0.5}{0.9} \approx 0.556 \]

On a more practical situation, if \(A\) is “a user makes a purchase” and \(B\) is “a user clicks on an advertisement”, then \(P(A|B)\) is the probability that a user makes a purchase GIVEN that they clicked on the advertisement. This is a key metric for evaluating ad campaign effectiveness.

Two very useful consequences of the above are: the law of total probability that combines the notion of partition with that of conditional probability and Bayes rule that allows us to reverse conditional probabilities.

NoteLaw of Total Probability

Proposition 1.2 Let \(A_1, A_2, \dots\) be a partition of the sample space \(\Omega\) (recall Definition 1.3). Then for any event \(B \in \mathcal{F}\): \[ P(B) = \sum_{i} P(B|A_i) P(A_i) \]

NoteBayes’ Rule

Proposition 1.3 For events \(A\) and \(B\) where \(P(B) > 0\): \[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \] If \(A_1, \dots, A_n\) form a partition of \(\Omega\), and \(P(A_i) > 0\) for all \(i\), then Bayes’ Rule can be written using the Law of Total Probability for the denominator: \[ P(A|B) = \frac{P(B|A) P(A)}{\sum_{i=1}^n P(B|A_i) P(A_i)} \]

The proof of this result is staightforward and left to the reader.

NoteExample of Law of Total Probability and Bayes’ Rule

Example 1.14 (Medical Testing) Suppose a rare disease affects 1 in 10,000 people. A test for this disease is 99% accurate:

  • If a person has the disease, the test correctly identifies it 99% of the time (True Positive).
  • If a person does not have the disease, the test correctly identifies it 99% of the time (True Negative).

Let \(D\) be the event that a person has the disease, and \(T^+\) be the event that the test is positive. The probabilities we know are:

  • \(P(D) = \frac{1}{10000} = 0.0001\) (Prevalence)
  • \(P(T^+|D) = 0.99\) (Sensitivity - True Positive Rate)
  • \(P(T^-|D^c) = 0.99\) (Specificity - True Negative Rate)

Before we proceed we note the probability specifications above are emprical.

Suppose we want to find \(P(D|T^+)\), the probability that a person actually has the disease given a positive test result.

First, we need \(P(T^+)\). A positive test can occur in two ways:

  • (\(D \cap T^+\)) or

  • (\(D^c \cap T^+\))

e.g. a partition of \(A\). We also have:

  • \(P(T^+|D^c) = 1 - P(T^-|D^c) = 1 - 0.99 = 0.01\) (False Positive Rate)
  • \(P(D^c) = 1 - P(D) = 1 - 0.0001 = 0.9999\)

Using the law of total probability:

\[ \begin{aligned} P(T^+) &=P(T^+ \cap D)+P(T^+\cap D^c)\\ & =P(T^+|D)P(D) + P(T^+|D^c)P(D^c)\\ &= (0.99)(0.0001) + (0.01)(0.9999)\\ &= 0.000099 + 0.009999 = 0.010098 \end{aligned} \]

Now, using Bayes’ Theorem: \[ \begin{aligned} P(D|T^+) &= \frac{P(T^+|D) P(D)}{P(T^+)} \\ &= \frac{(0.99)(0.0001)}{0.010098} \approx 0.0098 \end{aligned} \]

This may look counter-intuitive. Even with a positive test, there’s only about a 0.98% (less than 1%) chance the person actually has the disease! In particular it is a rare disease. This highlights the importance of understanding base rates and conditional probabilities in interpreting results.

Lets now code the previous example in Python. We code a function that returns \(P(D|T^+)\) given the prevalence of the disease, the sensitivity and the specificity of the test. WE vary the prevalence to see how it affects the result.

Code
import numpy as np
def bayes_medical_test(prevalence, sensitivity, specificity):
    P_D = prevalence  # Prevalence of the disease
    P_T_given_D = sensitivity  # Sensitivity (True Positive Rate)
    P_T_given_not_D = 1 - specificity  # False Positive Rate
    P_not_D = 1 - P_D  # Probability of not having the disease
    # Calculate P(T+)
    P_T = (P_T_given_D * P_D) + (P_T_given_not_D * P_not_D)
    # Calculate P(D|T+) using Bayes' Theorem
    P_D_given_T = (P_T_given_D * P_D) / P_T
    return P_D_given_T
# Example usage
prevalence = 1 / 10000  # 1 in 10,000
sensitivity = 0.99  # 99% sensitivity
specificity = 0.99  # 99% specificity
result_10k = bayes_medical_test(prevalence, sensitivity, specificity)
print(f"P(D|T+) = {result_10k:.4f}")

prevalence = 1 / 1000  # 1 in 1,000

result_1k = bayes_medical_test(prevalence, sensitivity, specificity)
print(f"P(D|T+) = {result_1k:.4f}")

prevalence = 1 / 100  # 1 in 100

result_100 = bayes_medical_test(prevalence, sensitivity, specificity)
print(f"P(D|T+) = {result_100:.4f}")
P(D|T+) = 0.0098
P(D|T+) = 0.0902
P(D|T+) = 0.5000

We can see how the prevalence of the disease affects the probability \(P(D|T^+)\) significantly. As the disease becomes more common, the probability that a person actually has the disease given a positive test result increases.

The Law of Total probability allows us to calculate the probability of an event \(A\) by considering the different ways it can occur through the events in a partition.

NoteCustomer churn

Example 1.15 Suppose we have three models, \(M_1\), \(M_2\), and \(M_3\), that are used to predict customer churn. Let \(P(M_1)=0.5\), \(P(M_2)=0.3\), \(P(M_3)=0.2\) be the probabilities that each model is the “best” for a given customer. Let \(A\) be the event “customer churns”. If we know the probability of churn given each best model (e.g., \(P(A|M_1)=0.1\), \(P(A|M_2)=0.2\), \(P(A|M_3)=0.15\)), the Law of Total Probability allows us to find the overall probability of churn:

\[ \begin{aligned} P(A) &= P(A|M_1)P(M_1) + P(A|M_2)P(M_2) + P(A|M_3)P(M_3) \\ &= (0.1)(0.5) + (0.2)(0.3) + (0.15)(0.2)\\ & = 0.05 + 0.06 + 0.03 = 0.14 \end{aligned} \]

1.4 Independence

1.4.1 Independence of events

First intuitively, two events, \(A\) and \(B\), are considered independent if the occurrence of one event does not affect the probability of the other event occurring. Formally,

NoteIndependent Events

Definition 1.7 Tvents \(A\) and \(B\) are independent if: \[ P(A \cap B) = P(A) \times P(B) \] or equivalently, if either of the following conditions hold:

  • \(P(A|B) = P(A)\)

  • \(P(B|A) = P(B)\)

This means that knowing that event \(B\) has occurred gives us no new information about the probability of event \(A\) occurring, and vice versa.

NoteIndependent event when flipping a coin twice

Consider the following events when flipping a coin twice:

  • \(A=\{HT, HH\}\) the first flip is heads
  • \(B=\{TT, HH\}\) the two flips are the same

Then using the probabilities in Example 1.11 we have:

\[ P(A\cap B) = P(\{HH\}) = 0.3 \neq P(\{HT,HH\})P(\{TT,HH\}) = 0.5\times 0.4 = 0.2 \]

Therefore these two events are not independent. Of course, the assignment of probabilities here play a role. In this way, if we had assigned \(P(\{HH\})=0.2\) then the events would have been independent.

An obvious consequence of Definition 1.6 of conditional probability is the so-called multiplication rule.

NoteMultiplication Rule

Proposition 1.4 For any two events \(A\) and \(B\) \[ P(A \cap B) = P(A) \times P(B|A) = P(B) \times P(A|B) \]

NoteDrawing Cards without Replacement

Example 1.16 Imagine drawing two cards from a standard deck without replacement. Let \(A\) be the event that the first card is a Heart. \(P(A) = \frac{13}{52}\). Let \(B\) be the event that the second card is a Heart. Since the first card is not replaced, these events are dependent. The probability of the second card being a Heart depends on the first card drawn. \(P(B|A)\) (the probability the second card is a Heart, given the first was a Heart) is \(\frac{12}{51}\) (as there are 12 Hearts left and 51 total cards). So, the probability of drawing two Hearts in a row is \(P(A \cap B) = P(A) \times P(B|A) = \frac{13}{52} \times \frac{12}{51}\).

Note

The outcome of flipping a coin maybe independent of the outcome of any previous coin flips. If you flip a coin and get heads, the probability of getting heads on the next flip should remain as before. Of course, this a simplifying assumption that may not hold in practice. In this course we will make these kind of assumption specially when it involves sequences of events. Not assuming independence for sequences of events make things more complicated for what we want to achieve in this course.

NoteIndependence of many events

Definition 1.8 A collection of events \(A_1, A_2, \ldots, A_n\) are mutually independent if for every subset of size \(k\), e.g. \(\{A_{i_1}, A_{i_2}, \ldots, A_{i_k}\}\) (\(k\) such that \(2 \le k \le n\)) we have that:

\[ P(A_{i_1} \cap A_{i_2} \cap \ldots \cap A_{i_k}) = P(A_{i_1}) \times P(A_{i_2}) \times \ldots \times P(A_{i_k}) \]

For example, three events \(A\), \(B\), and \(C\) are mutually independent if all the following conditions hold:

  • \(P(A \cap B) = P(A)P(B)\)
  • \(P(A \cap C) = P(A)P(C)\)
  • \(P(B \cap C) = P(B)P(C)\)
  • \(P(A \cap B \cap C) = P(A)P(B)P(C)\)

1.5 Random Variables

So far we have talked about events, which are subsets of the sample space. In many applications, especially in data science, we are interested in quantifying outcomes numerically. This is where random variables come into play.

NoteRandom Variable

Definition 1.9 A random variable \(X\) is a function that maps outcomes from the sample space \(\Omega\) to real numbers. That is, \(X: \Omega \to \mathbb{R}\). It quantifies the outcomes of a random phenomenon numerically.

The set of all possible values that \(X\) can take is called the image or range of the random variable, denoted as \(X(\Omega)\).

NoteRandom variable examples

Example 1.17 If \(\Omega\) is the set of all possible customer orders, a random variable \(X\) could be “the total dollar amount spent in an order”. For each order (an outcome in \(\Omega\)), \(X\) assigns a specific monetary value. As another example: for a user’s session on a website, \(X\) could be “the number of pages visited” or the “overall time spent in the website”.

Note a random variable is a function defined on the sample space \(\Omega\) rather than on a sigma algebra, so for each outcome \(\omega \in \Omega\), there is a corresponding real number \(X(\omega)\). However, events in a sigma algebra can be defined in terms of random variables. For example, the event “the total amount spent in an order is greater than 50 dollars” can be expressed as \(\{X > 50\}\).

NoteRandom variable: Number of equal coin flips

Example 1.18 When we flip a coin twice, the sample space is \(\Omega = \{TT, HT, TH, HH\}\). We can define a very simple random variable \(X\) as the “number of times the flips are the same”. The mapping would be:

  • \(X(\{TT\}) = 1\) (both flips are the same)
  • \(X(\{HT\}) = 0\) (flips are different)
  • \(X(\{TH\}) = 0\) (flips are different)
  • \(X(\{HH\}) = 1\) (both flips are the same)

The possible values of \(X\) are \(X(\Omega)=\{0, 1\}\).

NoteRandom variable: Number of heads in two coin flips

Example 1.19 When we flip a coin twice, the sample space is \(\Omega = \{TT, HT, TH, HH\}\). We can define a random variable \(X\) as the “number of heads” in the two flips. The mapping would be:

  • \(X(\{TT\}) = 0\) (no heads)
  • \(X(\{HT\}) = 1\) (one head)
  • \(X(\{TH\}) = 1\) (one head)
  • \(X(\{HH\}) = 2\) (two heads)

The possible values of \(X\) are \(\{0, 1, 2\}\). This random variable quantifies the outcome of the coin flips in terms of the number of heads observed. Also note the order in which the heads appear does not matter for this random variable.

The above two random variables are discrete random variables as they take on a finite or countable number of values, that is \(X(\Omega)\) is finite or countable. There are also continuous random variables that can take on any value in a continuous range.

NoteContinuous vs Discrete Random Variables

Example 1.20 Going back to Example 1.3 we have already defined a random variable \(X\) as the time to complete a task in a website with a limit of 5 minutes. If we round to the nearest second, then the possible values of \(X\) are \(\{0,1, 2, 3, \ldots, 300\}\) and \(X\) is a discrete random variable. However, if we do not round then \(X\) can take any value in the interval \((0,300)\) and \(X\) is a continuous random variable.

The definition of continuous random variables requires a bit more than simply having an uncountably infinite image set \(X(\Omega)\) . The definition is a bit thechnical as it involves the notion of probability density function.

NoteDiscrete and continuous random variables

Definition 1.10 We say a random variable \(X\) is

  • discrete if it takes on a finite or countably infinite number of distinct values. That is if the image set \(X(\Omega)\) is either finite or countably infinite.

The function: \[ f_X(x) = P(X=x):=P(\{\omega\,:\,X(\omega)=x\})\quad \mbox{for } x\in X(\Omega) \]

is called the probability mass function (PMF) of the discrete random variable \(X\). The PMF satisfies:

  • \(f_X(x) \ge 0\) for all \(x \in X(\Omega)\).

  • \(\sum_{x \in X(\Omega)} f_X(x) = 1\)

  • continuous if there exists a function \(f_X(x)\) such that for any two numbers \(a\) and \(b\) with \(a < b\): \[ P(a \le X \le b):=P(\{\omega\,:\, a \leq X(\omega)\leq b\}) = \int_a^b f_X(x) dx \] where

  • \(f_X(x) \ge 0\) for all \(x\) and

  • \(\int_{-\infty}^{\infty} f_X(x) dx = 1\).

The function \(f_X(x)\) is called the probability density function (PDF) of the random variable \(X\).

The idea is that there are no “gaps”, which would correspond to real numbers which have a finite probability of occurring. Instead, continuous random variables never take an exact prescribed value, that is \(P(X=x)=0\) for all \(x\) but there is a positive probability that its value will lie in particular intervals which can be arbitrarily small.

NoteCumulative Distribution Function (CDF)

Definition 1.11 The cumulative distribution function (CDF) of a random variable \(X\), denoted by \(F_X(x)\), is the function \(F: \mathcal R \to [0,1]\) defined by \[ F_X(x) = P(X \le x):=P(\{\omega\,:\,X(\omega)\leq x\}) \]

for any real number \(x\). The CDF gives the probability that the random variable \(X\) takes on a value less than or equal to \(x\).

We note that

\[ \begin{aligned} P(a \leq X <b ) &= F_X(b) - F_X(a)\\ \rule{0in}{4ex} &= \int_a^b f_X(x) dx \end{aligned} \]

so that

\[ f_X(x) = \frac{d}{dx} F_X(x)\qquad \forall x \in \mathbb R \]

NoteCDF of a discrete random variable

Example 1.21 Consider a discrete random variable \(X\) with possible values in the set \(\{0, 1, 2, 3\}\). Assume we probability mass function (pmf) is given by: \[ f_X(x) = P(X=x) = \begin{cases} 0.1 & \text{if } x = 0 \\ 0.3 & \text{if } x = 1 \\ 0.4 & \text{if } x = 2 \\ 0.2 & \text{if } x = 3 \\ 0 & \text{otherwise} \end{cases} \]

then the CDF is given by

\[ F_X(x) = P(X \le x) = \begin{cases} 0 & \text{if } x < 0 \\ 0.1 & \text{if } 0 \le x < 1 \\ 0.4 & \text{if } 1 \le x < 2 \\ 0.8 & \text{if } 2 \le x < 3 \\ 1.0 & \text{if } x \ge 3 \end{cases} \]

We can plot the CDF in Python as follows:

Code
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-1, 4, 1000)
y = np.piecewise(x, [x < 0, (x >= 0) & (x < 1), (x >= 1) & (x < 2), (x >= 2) & (x < 3), x >= 3], [0, 0.1, 0.4, 0.8, 1.0])
plt.step(x, y, where='post')
# emphasize the continuity from the right
plt.scatter([0, 1, 2, 3], [0.1, 0.4, 0.8, 1.0], color='blue')  # filled circles
plt.scatter([0, 1, 2,3], [0, 0.1, 0.4,0.8], color='white', edgecolor='blue')  # open circles
plt.title('CDF of Discrete Random Variable')
plt.xlabel('x')
plt.ylabel('F(x)')
plt.grid()
plt.yticks(np.array([0,0.1, 0.4, 0.8, 1.0]))
plt.show()
Figure 1.1: CDF of a discrete random variable. Note the function is defined over all real numbers

Figure 1.1 show the CDF is a step function with jumps at the points where the random variable takes values and is continuos from the right

NoteProperties of Cumulative Distribution Functions

For any random variable \(X\), its CDF \(F_X(x)\) has the following properties:

  1. Monotonicity: \(F_X(x)\) is non-decreasing. For any \(x_1 < x_2\), \(F_X(x_1) \le F_X(x_2)\).

  2. Limits: \(\lim_{x \to -\infty} F_X(x) = 0\) and \(\lim_{x \to \infty} F_X(x) = 1\).

  3. Right-continuity: \(F_X(x)\) is right-continuous, meaning \(\lim_{h \to 0^+} F_X(x+h) = F_X(x)\) for all \(x\).

The properties above hold for both discrete and continuous random variables. For discrete random variables, the CDF is a step function (continuous from the right), while for continuous random variables, the CDF is a continuous function.

NoteCDF of a continuous random variable

Example 1.22 Consider a random variable \(X\) representing the time (in hours) a server remains operational before crashing. \(X\) can take any non-negative real value. Assume the probability density function (pdf) is given by: \[ f_X(x)= \begin{cases} \frac{1}{100} e^{-x/100} & \text{if } x \ge 0 \\ \rule{0in}{3ex}0 & \text{if } x < 0 \end{cases} \]

This probability distribution is called an exponential distribution with a mean of 100 hours. We will define and talk about mean later. The CDF is computed as follows:

\[ F_X(x) = P(X \le x) = \int_{-\infty}^x f_X(t) dt = \begin{cases} 0 & \text{if } x < 0 \\ 1 - e^{-x/100} & \text{if } x \ge 0 \end{cases} \]

The CDF \(F_Y(y) = P(Y \le y)\) would give the probability that the server operates for at most \(y\) hours. For instance, \(F_Y(10)\) would be the probability the server fails within the first 10 hours.

We can plot the CDF and the PDF in Python as follows:

Code
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-50, 400, 1000)
pdf = np.piecewise(x, [x < 0, x >= 0], [0, lambda x: (1/100) * np.exp(-x/100)])
cdf = np.piecewise(x, [x < 0, x >= 0], [0, lambda x: 1 - np.exp(-x/100)])
plt.subplots(2,1, sharex=True)
plt.subplot(2, 1, 1)
plt.plot(x, pdf, label='PDF', color='blue')
plt.title('PDF')
plt.ylabel('f(x)')
plt.grid()
plt.subplot(2, 1, 2)
plt.plot(x, cdf, label='CDF', color='orange')
plt.title('CDF')
plt.xlabel('x')
plt.ylabel('F(x)')
plt.grid()
plt.show()
Figure 1.2: PDF and CDF of of an Exponential Random Variable
NoteSupport of a Random Variable

Definition 1.12 The support of a random variable is the set of values where its probability distribution is non-zero.

  • For a discrete random variable, the support is the set of values \(x\) such that \(f_X(x) > 0\).

  • For a continuous random variable, the support is the set of values \(x\) where the PDF \(f_X(x) > 0\).

In terms of the sample space \(\Omega\), the support is equal to \(X(\Omega)\)

1.6 Common Probability Distributions

1.6.1 Discrete distributions

Some common discrete probability distributions include:

NoteDiscrete Uniform Random Variable

Definition 1.13 Let \(X\) be a random variable that can take on any of \(k\) equally likely values. The PMF is given by: \[ f_X(x|k) = \begin{cases} \frac{1}{k} & \text{if } x \in \{1, 2, \ldots, k\} \\ 0 & \text{otherwise} \end{cases} \]

We only need to specify \(k\) to specify the distribution, hence the notations \(f_X(x|k)\) for the PMF.

Note the choice of support is arbitrary. We could have chosen any \(k\) distinct values for the support, we choose the first \(k\) integers for simplicity/convenience.

Examples of random variables modelled this way is the outcome of rolling a fair \(k=6\)-sided die or the outcome of randomly selecting one item from a set of \(k\) distinct items. The notion of classical probability in Section 1.2.1 is equivalent to assuming a discrete uniform distribution to the random variable that assigns a real number to each ouctome.

NoteGeneral discrete random variable

Definition 1.14 Let \(X\) be a discrete random variable with possible values in the set \(\{x_1, x_2, \ldots,x_k\}\). The PMF is given by: \[ f_X(x|p_1,\ldots,p_{k-1}) = P(X=x) = \begin{cases} p_i & \text{if } x = x_i \\ 0 & \text{otherwise} \end{cases} \]

where \(p_i > 0\) for all \(i\) and \(\sum_i p_i = 1\).

We only need to specify \(k-1\) probabilities as the last one is determined by the fact that the probabilities must sum to 1. Hence the notation \(f_X(x|p_1,\ldots,p_{k-1})\) for the PMF.

NoteBernoulli and Binomial Random Variables

Definition 1.15 Bernoulli Distribution: Models a single binary outcome (success/failure) with parameter \(p\) (probability of success). The PMF is given by: \[ f_X(x) = \begin{cases} p & \text{if } x = 1 \\ 1 - p & \text{if } x = 0 \\ 0 & \text{otherwise} \end{cases} \]

We only need to specify \(p\) to specify the distribution. We could have chosen any two distinct values instead of 0/1, we choose 0/1 for simplicity/convenience.

Binomial Distribution: If \(n\) identical Bernoulli trials are performed, define the events:

  • \(A_i\): the \(i\)-th trial is a success (for \(i = 1, 2, \ldots, n\)) with \(P(A_i) = p\).

If we assume the events \(A_1,\ldots, A_n\) are mutually independent ( as in Definition 1.8), then the random variable \(Y\) defined as the number of successes in the \(n\) trials follows a Binomial distribution.

The event \(\{Y = y\}\) will occur only if, out of the events \(A_1,\ldots, A_n\), exactly \(y\) of them occur, and necessarily \(n - y\) of them do not occur. For example, when \(y=2\), one particular outcome (one particular ordering of occurrences and nonoccurrences) of the n Bernoulli trials might be:

\[A_1, A_2, A_3^c, A_4^c, \ldots, A_n^c\]

which has probability

\[p^2(1-p)^{n-2}\]

However, there are many such orderings that lead to the same event \(\{Y = 2\}\), for example: \[A_2, A_5, A_1^c, A_3^c, A_4^c, \ldots, A_n^c\]

which also has probability \(p^2(1-p)^{n-2}\). The number of such orderings is the number of ways of choosing 2 successes from \(n\) trials, which is given by the binomial coefficient \(\binom{n}{2}\). In general, the number of ways of choosing \(y\) successes from \(n\) trials is given by the binomial coefficient \(\binom{n}{y}\). Therefore, the PMF of the Binomial distribution is given by: \[ f_Y(y|n,p) = P(Y=y) = \binom{n}{y} p^y (1 - p)^{n-y} \quad \text{for } y = 0, 1, \ldots, n \]

We only need to specify \(n\) and \(p\) to specify the distribution, hence the notation \(f_Y(y|n,p)\) for the PMF.

NoteHypergeometric Random Variable

Definition 1.16 The hypergeometric distribution models the number of successes in a sequence of \(n\) draws from a finite population without replacement. So is similar to the Binomial distribution but without the assumption of independence of the Bernoulli trials.

It is easy to describe this distribution with a concrete example. Suppose we have an urn with:

  • a total of \(n\) balls

  • \(m\) balls are red and

  • \(n-m\) are green.

We select \(k\) balls at random (the \(k\) balls are taken all at once, a case of sampling without replacement) so that \(k\leq n\). What is the probability that exactly \(y\) of the balls are red?

The corresponding random variable \(Y\) is the number of red balls in the sample of size \(k\).

The support of the random variable \(Y\) can be obtained using the following reasoning:

  • To obtain the minimum number of red balls, there are two cases:

    • if there more green balls than those we can choose(\(k\leq \textcolor{green}{n-m}\)) and if we happen to choose all green balls then the number of red balls is 0, e.g. \(\textcolor{red}{Y}=0\) and this is the smallest it can be.

    • if \(k>\textcolor{green}{n-m}\) and we choose all green balls then all the remaining \(k-\textcolor{green}{(n-m)}\) balls are necessarily red so \(Y=k-\textcolor{green}{(n-m)}>0\) and this is the smallest it can be.

Therefore, the minimum number of red balls is \(\max\{0, k-\textcolor{green}{(n-m)}\}\).

  • For the maximum number of red balls, there are two cases:

    • we cannot choose more red balls than there are in the urn (e.g. \(\textcolor{red}{m}\leq k\)) so that the maximum number of red balls is \(\textcolor{red}{m}\).

    • we cannot choose more red balls than the total number \(k\) of balls we are choosing (e.g. \(k>\textcolor{red}{m}\)) so that the maximum number of red balls is \(k\)

Therefore, the maximum number of red balls is \(\min\{\textcolor{red}{m},k\}\).

We also have that:

  • the number of ways of choosing \(k\) balls from \(n\) is \(\binom{n}{k}\)

  • the number of ways of choosing \(y\) red balls from the \(\textcolor{red}{m}\) red balls is \(\binom{\textcolor{red}{m}}{y}\) and

  • the number of ways of choosing the remaining \(k-y\) balls from the \(\textcolor{green}{n-m}\) green balls is \(\binom{\textcolor{green}{n-m}}{k-y}\). Therefore, the PMF of the Hypergeometric distribution is given by: \[ f_Y(\textcolor{red}{y}|n,\textcolor{red}{m},k) = P(\textcolor{red}{Y}=\textcolor{red}{y}) = \begin{cases} \frac{\displaystyle\binom{\textcolor{red}{m}}{\textcolor{red}{y}} \binom{\textcolor{green}{n-m}}{k-\textcolor{red}{y}}}{\displaystyle\binom{n}{k}} \quad \text{ for } y = \max\{0, k-\textcolor{green}{(n-m)}\}, \ldots, \min\{\textcolor{red}{m},k\} \\ \rule{0in}{5ex} \quad \text{ otherwise} \end{cases} \]

The equally likely implicit assumption can be justified if we can guarantee the balls are randomly choosen.

We only need to specify \(n\), \(m\) and \(k\) to specify the distribution, hence the notation \(f_Y(y|n,m,k)\) for the PMF.

NotePoisson Random Variable

Definition 1.17 This random variable is relevant when we are modeling the ocurrences of an event in time or space. For example

  • waiting for a bus to arrive,

  • waiting for customers to arrive in a bank,

  • number of damaged trees in a given area of a forest

  • the number of defects in a given length of communications cable

The number of occurrences in a given interval or area can sometimes be modeled by the Poisson distribution.

The Poisson distribution is based on the following assumptions:

  • for small time intervals, the probability of an event is proportional to the length of waiting time or area size

  • The number of events in disjoint time intervals or disjoint areas are independent.

  • the intensity \(\lambda\) (average rate of occurrence) is constant over time or space.

So let define random variable \(X\) that models the number of events occurring in a fixed interval of time or area of space, given an intensity parameter \(\lambda\) (which also relative to the length of time or area size). The PMF is given by: \[ f_X(x|\lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \quad \text{for }x = 0, 1, 2, \ldots \]

We only need to specify the rate \(\lambda\) to specify the distribution, hence the notation \(f_X(x|\lambda)\) for the PMF.

NoteNegative Binomial Random Variable

Definition 1.18 The negative binomial distribution models the number of trials needed to achieve a fixed number of successes in a sequence of independent Bernoulli trials, each with the same probability of success \(p\).

Let \(X\) be the random variable representing the number of trials needed to achieve \(r\) successes. The PMF is given by: \[ f_X(x|r,p) = \binom{x-1}{r-1} p^r (1 - p)^{x-r} \quad \text{for } x = r, r+1, r+2, \ldots \]

This is because the \(r\)-th success must occur on the \(x\)-th trial, and the previous \(x-1\) trials must contain exactly \(r-1\) successes. The probability of any such sequence of trials is \(p^r (1 - p)^{x-r}\). The number of ways to choose which \(r-1\) trials out of the first \(x-1\) are successes is given by the binomial coefficient \(\binom{x-1}{r-1}\).

We only need to specify \(r\) and \(p\) to specify the distribution, hence the notation \(f_X(x|r,p)\) for the PMF.

The specific case where \(r=1\) is called the geometric distribution, which models the number of trials until the first success. The PMF for the geometric distribution is given by: \[ f_X(x|p) = (1 - p)^{x-1} p \quad \text{for } x = 1, 2, 3, \ldots \]

There are alternative definitions of the Negative Binomial distribution. For example, the random variable \(Y\) is defined as the number of failures before the \(r\)-th success. The PMF in this case is given by: \[ f_Y(y|r,p) = \binom{y+r-1}{r-1} p^r (1 - p)^{y}= \binom{y+r-1}{y} p^r (1 - p)^{y}\quad \text{for } y = 0, 1, 2, \ldots \]

Clearly, the two random variables are related in that \(Y=X-r\).

1.6.2 Continuous distributions

Some common continuous probability distributions include:

NoteContinuous Uniform

Definition 1.19 Models a continuous random variable such that intervals of the same length are equally likely. The support is the interval \((a, b)\) for \(a<b\). The probability density fucntion (PDF) is given by: \[ f_X(x|a,b) = \begin{cases} \frac{1}{b - a} & \text{if } a < x < b \\ 0 & \text{otherwise} \end{cases} \]

We only need to specify \(a\) and \(b\) to specify the distribution. The CDF for this distribution is given by: \[ F_X(x|a,b) = \begin{cases} 0 & \text{if } x < a \\ \frac{x - a}{b - a} & \text{if } a \le x \le b \\ 1 & \text{if } x > b \end{cases} \]

NoteNormal (Gaussian) Random Variable

Definition 1.20 Models a continuous random variable with a bell-shaped curve, characterized by its mean \(\mu\) and standard deviation \(\sigma\). The PDF is given by: \[ f_X(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \quad \text{for } x \in \mathbb{R} \] The CDF for this distribution is given by: \[ F_X(x) = \int_{-\infty}^x f_X(u)\,du = \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x - \mu}{\sigma \sqrt{2}} \right) \right] \] where \(\text{erf}\) is the error function. Usually, we will express this CDF in terms of the CDF of the standard normal distribution (mean 0 and standard deviation 1) that we will denote by \(\Phi(x)\). Then we can write: \[ F_X(x) = \Phi\left( \frac{x - \mu}{\sigma} \right) \] where \(\Phi(x)\) is the CDF of the standard normal distribution, that is

\[ \Phi(z)=\int_{-\infty}^z \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} du \]

NoteExponential Random Variable

Definition 1.21 THis random variable can be used to model the time (continuouslye.g. infinite precision) to the ocurrence of an event of interest or the time in between events of interest. It is fully characterised by the rate parameter \(\lambda\). The PDF is given by: \[ f_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \]

The CDF for this distribution is given by: \[ F_X(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 - e^{-\lambda x} & \text{if } x \ge 0 \end{cases} \]

1.6.3 Joint Distributions

We can define more than one random variable on the same sample space.

TipRandom Vectors

Definition 1.22 A random vector is a vector whose components are random variables defined on the same probability space. If we have \(k\) random variables \(X_1, X_2, \ldots, X_k\) defined on the same sample space \(\Omega\), we can define a random vector \(\mathbf{X}\) as the function from \(\Omega\) to \(\mathbb{R}^k\) given by:

\[ \mathbf{X}(\omega) := (X_1(\omega), X_2(\omega), \ldots, X_n(\omega)) \]

NoteJoint Distribution of two random variables

Definition 1.23 The joint distribution of two random variables \(X\) and \(Y\) describes the probability distribution of their combined outcomes. It can be easily represented by the joint cumulative distribution function (CDF) defined as: \[ F_{X,Y}(x,y) = P(X \leq x, Y \leq y) \] for any real numbers \(x\) and \(y\). This definition is irrespective of whether the random variables are discrete or continuous.

The joint distribution can also represented by the joint probability mass function (PMF) for discrete random variables or the joint probability density function (PDF) for continuous random variables.

For two discrete random variables \(X\) and \(Y\), the joint PMF is defined as: \[ f_{X,Y}(x,y) = P(X = x, Y = y) \]

for all possible joint values \((x, y)\) in the image set \((X,Y)(\Omega)\).

For two continuous random variables \(X\) and \(Y\), the joint PDF can be defined if there exists a function \(f_{X,Y}(x,y)\) such that for any two numbers any subset \(A\subset \mathbb{R}^2\): \[ P((X,Y)\in A) = \int _A \int f_{X,Y}(x,y) \, dx \, dy \]

The joint PDF can also be defined as the partial cross-derivative of their joint cumulative distribution function (CDF): \[ f_{X,Y}(x,y) = \frac{\partial^2}{\partial x \partial y} P(X \leq x, Y \leq y) \]

The joint PDF satisfies:

  • \(f_{X,Y}(x,y) \ge 0\) for all \(x, y \in \mathbb{R}\).
  • \(\rule{0in}{4ex}\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \, dy = 1\).
NoteMarginal Distributions

Definition 1.24 The marginal distribution of a random variable is the probability distribution of that variable when considered independently of other variables. It is obtained by summing (for discrete variables) or integrating (for continuous variables) the joint distribution over the values of the other variables. For two discrete random variables \(X\) and \(Y\) with joint PMF \(f_{X,Y}(x,y)\), the marginal PMFs are given by: \[ f_X(x) = \sum_{y} f_{X,Y}(x,y) \] \[ f_Y(y) = \sum_{x} f_{X,Y}(x,y) \] For two continuous random variables \(X\) and \(Y\) with joint PDF \(f_{X,Y}(x,y)\), the marginal PDFs are given by: \[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy \] \[ f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \]

We can define marginal CDFs in a similar manner. For two random variables \(X\) and \(Y\) with joint CDF \(F_{X,Y}(x,y)\), the marginal CDFs are given by: \[ F_X(x) = \lim_{y \to \infty} F_{X,Y}(x,y) \] \[ F_Y(y) = \lim_{x \to \infty} F_{X,Y}(x,y) \]

NoteJoint Distribution in the case of two coin flips

Example 1.23 When flipping a coin twice, we can define two random variables:

  • \(X_1\): outcome of the first flip (1 for heads, 0 for tails)
  • \(X_2\): outcome of the second flip (1 for heads, 0 for tails)

Following from Example 1.11 The joint distribution of \(X_1\) and \(X_2\) can be represented in a table:

\(X_1 \backslash X_2\) 0 (Tails) 1 (Heads)
0 (Tails) 0.1 0.4
1 (Heads) 0.2 0.3

The joint PMF is given by:

  • \(f_{X_1,X_2}(0,0)=P(X_1=0, X_2=0) = P(\{TT\}) = 0.1\)
  • \(f_{X_1,X_2}(0,1)=P(X_1=0, X_2=1) = P(\{TH\}) = 0.4\)
  • \(f_{X_1,X_2}(1,0)=P(X_1=1, X_2=0) = P(\{HT\}) = 0.2\)
  • \(f_{X_1,X_2}(1,1)=P(X_1=1, X_2=1) = P(\{HH\}) = 0.3\)

The marginal distributions of \(X_1\) and \(X_2\) can be obtained by summing over the rows and columns respectively:

\[ f_{X_1}(x_1) = \sum_{x_2\in\{0,1\}} f_{X_1,X_2}(x_1,x_2) =\begin{cases} f_{X_1,X_2}(0,0) + f_{X_1,X_2}(0,1) = 0.1 + 0.4=0.5 & \text{if } x_1 = 0 \\ f_{X_1,X_2}(1,0) + f_{X_1,X_2}(1,1) = 0.2 + 0.3=0.5 & \text{if } x_1 = 1 \\ 0 & \text{otherwise} \end{cases} \]

\[ f_{X_2}(x_2)= \sum_{x_1\in\{0,1\}} f_{X_1,X_2}(x_1,x_2) =\begin{cases} f_{X_1,X_2}(0,0) + f_{X_1,X_2}(1,0) = 0.1 + 0.2 =0.3 & \text{if } x_2 = 0 \\ f_{X_1,X_2}(0,1) + f_{X_1,X_2}(1,1) = 0.4 + 0.3 = 0.7 & \text{if } x_2 = 1 \\ 0 & \text{otherwise} \end{cases} \]

Now consider another random variable \(Z\) be the indicator that the two flips are the same, that is \(Z=1\) if \(X_1=X_2\) and \(Z=0\) otherwise. The joint distribution of \(X_1\) and \(Z\) can be represented in a table:

\(X_1 \backslash Z\) 0 (Different) 1 (Same)
0 (Tails) 0.4 0.1
1 (Heads) 0.2 0.3

The joint PMF is given by:

  • \(f_{X_1,Z}(0,0)=P(X_1=0, Z=0) = P(\{TH\}) = 0.4\)
  • \(f_{X_1,Z}(0,1)=P(X_1=0, Z=1) = P(\{TT\}) = 0.1\)
  • \(f_{X_1,Z}(1,0)=P(X_1=1, Z=0) = P(\{HT\}) = 0.2\)
  • \(f_{X_1,Z}(1,1)=P(X_1=1, Z=1) = P(\{HH\}) = 0.3\)

The marginal distributions of \(X_1\) and \(Z\) can be obtained by summing over the rows and columns respectively:

\[ f_{X_1}(x_1) = \sum_{z\in\{0,1\}} f_{X_1,Z}(x_1,z)= \begin{cases} f_{X_1,Z}(0,0) + f_{X_1,Z}(0,1) = 0.4 + 0.1=0.5 & \text{if } x_1 = 0 \\ f_{X_1,Z}(1,0) + f_{X_1,Z}(1,1) = 0.2 + 0.3=0.5 & \text{if } x_1 = 1 \\ 0 & \text{otherwise} \end{cases} \]

\[ f_Z(z)= \sum_{x_1\in\{0,1\}} f_{X_1,Z}(x_1,z)= \begin{cases} f_{X_1,Z}(0,0) + f_{X_1,Z}(1,0) = 0.4 + 0.2 =0.6 & \text{if } z = 0 \\ f_{X_1,Z}(0,1) + f_{X_1,Z}(1,1) = 0.1 + 0.3 = 0.4 & \text{if } z = 1 \\ 0 & \text{otherwise} \end{cases} \]

TipExamples of joint continuous Distribution

Example 1.24 Consider two continuous random variables \(X\) and \(Y\) with the joint PDF given by: \[ f_{X,Y}(x,y) = \begin{cases} 2 & \text{if } 0 < x < 1 \text{ and } 0 < y < x \\ 0 & \text{otherwise} \end{cases} \]

We can verify that this is a valid joint PDF by checking that it is non-negative and integrates to 1 over the entire plane: \[ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy \, dx = \int_0^1 \int_0^x 2 \, dy \, dx = \int_0^1 2x \, dx = 1 \] The support of the joint distribution is the triangular region in the \(xy\)-plane where \(0 < x < 1\) and \(0 < y < x\).

The marginal distributions can be computed as follows: \[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy = \int_0^x 2 \, dy = 2x \quad \text{for } 0 < x < 1 \] \[ f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx = \int_y^1 2 \, dx = 2(1 - y) \quad \text{for } 0 < y < 1 \]

We can also obtain the joint cumulative distribution function (CDF) as follows. For any \(0<y<x<1\) \[ \begin{aligned} F_{X,Y}(x,y) &= P(X \leq x, Y \leq y)\\ & = \int_{-\infty}^x \int_{-\infty}^y f_{X,Y}(u,v) \, dv \, du\\ &=\int_{0}^y \int_{0}^u f_{X,Y}(u,v) \, dv \, du+\int_{y}^x \int_{0}^y f_{X,Y}(u,v) \, dv \, du\\ &=\int_{0}^y \int_{0}^u 2 \, dv \, du+\int_{y}^x \int_{0}^y 2 \, dv \, du\\ &= \int_{0}^y 2u \, du+\int_{y}^x 2y \, du\\ &= y^2 + 2y(x-y) \\ \end{aligned} \]

The marginals CDFs can be computed as follows:

\[ F_X(x) = \lim_{y \to \infty} F_{X,Y}(x,y) = F_{X,Y}(x,y) = x^2 + 2x(x-x) = x^2 \quad \text{for } 0 < x < 1 \]

\[ F_Y(y) = \lim_{x \to \infty} F_{X,Y}(x,y) = F_{X,Y}(1,y) = y^2 + 2y(1-y) = 2y - y^2 \quad \text{for } 0 < y < 1 \]

or, alternatively, we can compute the marginal CDFs directly from the marginal PDFs as follows:

\[ F_X(x) = \int_{-\infty}^x f_X(u) \, du = \int_0^x 2u \, du = x^2 \quad \text{for } 0 < x < 1 \] \[ F_Y(y) = \int_{-\infty}^y f_Y(v) \, dv = \int_0^y 2(1 - v) \, dv = 2y - y^2 \quad \text{for } 0 < y < 1 \]

Now consider another joint PDF given by:

\[ f_{X,Y}(x,y) = \begin{cases} 6xy^2 & \text{if } 0 < x < 1 \text{ and } 0 < y < 1 \\ 0 & \text{otherwise} \end{cases} \] We can verify that this is a valid joint PDF by checking that it is non-negative and integrates to 1 over the entire plane: \[ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy \, dx = \int_0^1 \int_0^1 6xy^2 \, dy \, dx = \int_0^1 2x \, dx = 1 \] The support of the joint distribution is the unit square in the \(xy\)-plane where \(0 < x < 1\) and \(0 < y < 1\).

The marginal distributions can be computed as follows:

\[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy = \int_0^1 6xy^2 \, dy = 2x \quad \text{for } 0 < x < 1 \] \[ f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx = \int_0^1 6xy^2 \, dx = 3y^2 \quad \text{for } 0 < y < 1 \] Then we have

\[ f_{X,Y}(x,y)=f_X(x)f_Y(y) \]

so the random variables \(X\) and \(Y\) are independent. THis also implies that the joint cumulative distribution function (CDF) is given by: \[ F_{X,Y}(x,y) = F_X(x) F_Y(y) \] where \(F_X(x)\) and \(F_Y(y)\) are the marginal CDFs of \(X\) and \(Y\), respectively. We can compute these marginal CDFs as follows: \[ F_X(x) = \int_{-\infty}^x f_X(u) \, du = \int_0^x 2u \, du = x^2 \quad \text{for } 0 < x < 1 \] \[ F_Y(y) = \int_{-\infty}^y f_Y(v) \, dv = \int_0^y 3v^2 \, dv = y^3 \quad \text{for } 0 < y < 1 \] Therefore, the joint CDF is given by: \[ F_{X,Y}(x,y) = F_X(x) F_Y(y) = x^2 y^3 \quad \text{for } 0 < x < 1 \text{ and } 0 < y < 1 \]

NoteIndependent Random Variables

Definition 1.25 Two discrete random variables \(X\) and \(Y\) are independent if for all \(x\) and \(y\) in their respective image sets: \[ P(X = x, Y = y) = P(X = x) \times P(Y = y) \] for all \(x\) and \(y\). Equivalently, if either of the following conditions hold for all \(x\) and \(y\):

  • \(P(X = x | Y = y) = P(X = x)\)
  • \(P(Y = y | X = x) = P(Y = y)\)

for all \(x\) and \(y\).

For the case of continuous random variables, \(X\) and \(Y\) are independent if for all \(x\) and \(y\) in their respective image sets: \[ f_{X,Y}(x, y) = f_X(x) \times f_Y(y) \]

where \(f_{X,Y}(x, y)\) is the joint probability density function of \(X\) and \(Y\), and \(f_X(x)\) and \(f_Y(y)\) are the marginal probability density functions of \(X\) and \(Y\), respectively.

NoteExample of Independent Random Variables

Example 1.25 We can check if \(X_1\) and \(X_2\) define d in Example 1.23 are independent random variables. We have: \[ P(X_1=0, X_2=0) = 0.1 \neq P(X_1=0)P(X_2=0) = 0.5\times 0.4 = 0.2 \] so they are not independent.

We can check if \(X_1\) and \(Z\) defined in Example 1.23 are independent random variables. We have: \[ P(X_1=0, Z=0) = 0.4 = P(X_1=0)P(Z=0) = 0.5\times 0.6 = 0.3 \] so they are not independent.

1.7 Conditional distributions

NoteConditional Distribution of two random variables

Definition 1.26 The conditional distribution of a random variable \(X\) given another random variable \(Y\) describes the probability distribution of \(X\) when the value of \(Y\) is known. It is represented by the conditional probability mass function (PMF) for discrete random variables or the conditional probability density function (PDF) for continuous random variables. For two discrete random variables \(X\) and \(Y\), the conditional PMF of \(X\) given \(Y=y\) is defined as: \[ f_{X|Y}(x|y) = P(X = x | Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)} = \frac{f_{X,Y}(x,y)}{f_Y(y)} \]

for all possible values \(x\) in the image set of \(X\) and for all \(y\) such that \(P(Y = y) > 0\). For two continuous random variables \(X\) and \(Y\), the conditional PDF of \(X\) given \(Y=y\) is defined as: \[ f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \] for all possible values \(x\) in the image set of \(X\) and for all \(y\) such that \(f_Y(y) > 0\).

NoteExample of Discrete Conditional Distribution

Example 1.26 We can compute the conditional distribution of \(X_1\) given \(Z\) in Example 1.23. We have: \[ f_{X_1|Z}(x|z) = \frac{f_{X_1,Z}(x,z)}{f_Z(z)} \] for all possible values \(x\) in the image set of \(X_1\) and for all \(z\) such that \(f_Z(z) > 0\). We have:

  • \(f_{X_1|Z}(0|0) = \frac{f_{X_1,Z}(0,0)}{f_Z(0)} = \frac{0.4}{0.6} = \frac{2}{3}\)
  • \(f_{X_1|Z}(1|0) = \frac{f_{X_1,Z}(1,0)}{f_Z(0)} = \frac{0.2}{0.6} = \frac{1}{3}\)

the other conditional PMF is:

  • \(f_{X_1|Z}(0|1) = \frac{f_{X_1,Z}(0,1)}{f_Z(1)} = \frac{0.1}{0.4} = \frac{1}{4}\)
  • \(f_{X_1|Z}(1|1) = \frac{f_{X_1,Z}(1,1)}{f_Z(1)} = \frac{0.3}{0.4} = \frac{3}{4}\)
TipExample of Continuous Conditional Distribution

Example 1.27 Consider the joint PDF given in Example 1.24: \[ f_{X,Y}(x,y) = \begin{cases} 2 & \text{if } 0 < x < 1 \text{ and } 0 < y < x \\ 0 & \text{otherwise} \end{cases} \] We can compute the conditional distribution of \(Y\) given \(X=x\). We have: \[ f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} \] for all possible values \(y\) in the image set of \(Y\) and for all \(x\) such that \(f_X(x) > 0\). We have: \[ f_{Y|X}(y|x) = \frac{2}{2x} = \frac{1}{x} \quad \text{for } 0 < y < x \] and 0 otherwise.

The other conditional distribution is: \[ f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \] for all possible values \(x\) in the image set of \(X\) and for all \(y\) such that \(f_Y(y) > 0\). We have: \[ f_{X|Y}(x|y) = \frac{2}{2(1-y)} = \frac{1}{1-y} \quad \text{for } y < x < 1 \] and 0 otherwise.

1.8 Moments, Variance, covariance and correlation

NoteEXpectation and Variance of a Random Variable

Definition 1.27 The expectation, expected value or mean of a random variable \(X\), denoted by \(E[X]\) or \(\mu_X\), is defined as:

  • For a discrete random variable: \[ E[X] = \sum_{x \in X(\Omega)} x \cdot P(X = x) = \sum_{x \in X(\Omega)} x \cdot f_X(x) \]
  • For a continuous random variable: \[ E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx \]

The variance of a random variable \(X\), denoted by \(Var(X)\) is defined as: \[ Var(X) = E[(X - E[X])^2] \]

It is easy to show that:

\[ Var(X) = E[X^2] - (E[X])^2 \]

The standard deviation of \(X\), is the square root of the variance: \(\sigma_X = \sqrt{Var(X)}\)

The variance measures the spread or dispersion of the random variable around its mean. A higher variance indicates that the values of the random variable are more spread out, while a lower variance indicates that they are more concentrated around the mean.

NoteExamples of Expectation and Variance

Example 1.28  

  • For a discrete uniform random variable with parameter \(k\):
    • \(E[X] = \frac{k + 1}{2}\)
    • \(Var(X) = \frac{k^2 - 1}{12}\)
  • For a bernoulli random variable with parameter \(p\):
    • \(E[X] = p\)
    • \(Var(X) = p(1 - p)\)
  • For a binomial random variable with parameters \(n\) and \(p\):
    • \(E[X] = np\)
    • \(Var(X) = np(1 - p)\)
  • For a hypergeometric random variable with parameters \(n\), \(m\) and \(k\):
    • \(E[X] = k\frac{m}{n}\)
    • \(Var(X) = k\frac{m}{n}\frac{n-m}{n}\frac{n-k}{n-1}\)
  • For a Poisson random variable with parameter \(\lambda\):
    • \(E[X] = \lambda\)
    • \(Var(X) = \lambda\)
  • For the negative binomial random variable with parameters \(r\) and \(p\):
    • \(E[X] = \frac{r}{p}\)
    • \(Var(X) = \frac{r(1 - p)}{p^2}\)
    • \(E[Y] = \frac{r(1-p)}{p}\)
    • \(Var(X) = \frac{r(1 - p)}{p^2}\)
  • For a uniform random variable on the interval \([a, b]\):
    • \(E[X] = \frac{a + b}{2}\)
    • \(Var(X) = \frac{(b - a)^2}{12}\)
  • For a normal random variable with mean \(\mu\) and standard deviation \(\sigma\):
    • \(E[X] = \mu\)
    • \(Var(X) = \sigma^2\)
  • For an exponential random variable with rate parameter \(\lambda\):
    • \(E[X] = \frac{1}{\lambda}\)
    • \(Var(X) = \frac{1}{\lambda^2}\)

These can be derived from the definitions above and the corresponding probability mass or density functions.

TipExpected values of functions of random variables

Definition 1.28 The expected value of a function \(g(X)\) of a random variable \(X\) is given by:

  • For a discrete random variable: \[ E[g(X)] = \sum_{x \in X(\Omega)} g(x) \cdot P(X = x) = \sum_{x \in X(\Omega)} g(x) \cdot f_X(x) \]
  • For a continuous random variable: \[ E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx \]

For joint random variables \(X\) and \(Y\), the expected value of a function \(g: \mathbb{R}^2\to \mathbb{R}\) is given by:

  • For discrete random variables: \[ \begin{aligned} E[g(X, Y)] &= \sum_{(x,y) \in (X,Y)(\Omega)} \sum\, g(x, y) \cdot P(X = x, Y = y)\\ & = \sum_{(x,y) \in (X,Y)(\Omega)} \sum\, g(x, y) \cdot f_{X,Y}(x,y) \end{aligned} \]
  • For continuous random variables: \[ E[g(X, Y)] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x, y) \cdot f_{X,Y}(x,y) \, dx \, dy \]
NoteHIgher order moments and moment generating function

Definition 1.29 The \(n\)-th moment of a random variable \(X\) is defined as: \[ E[X^n] = \begin{cases} \displaystyle \sum_{x \in X(\Omega)} x^n \cdot P(X = x) & \text{if } X \text{ is discrete} \\ \rule{0in}{4ex} \displaystyle\int_{-\infty}^{\infty} x^n \cdot f_X (x) \, dx & \text{if } X \text{ is continuous} \end{cases} \] The moment generating function (MGF) of a random variable \(X\) is defined as: \[ M_X(t) = E[e^{tX}] = \begin{cases} \displaystyle\sum_{x \in X(\Omega)} e^{tx} \cdot P(X = x) & \text{if } X \text{ is discrete} \\ \rule{0in}{4ex} \displaystyle \int_{-\infty}^{\infty} e^{tx} \cdot f _X(x) \, dx & \text{if } X \text{ is continuous} \end{cases} \] The MGF can be used to compute the moments of a random variable. The \(n\) -th moment of \(X\) can be obtained by taking the \(n\)-th derivative of the MGF and evaluating it at \(t=0\): \[ E[X^n] = M_X^{(n)}(0) = \left. \frac{d^n}{dt^n} M_X(t) \right|_{t=0} \]

Notes:

  • We are assuming here that the MGF exists in a neighborhood of \(t=0\).
  • The derivative formula above does not depend on whether \(X\) is discrete or continuous.
  • The first moment is the mean: \(E[X] = M_X'(0)\)
  • The second moment is \(E[X^2] = M_X''(0)\)
  • The variance can be computed as: \(Var(X) = M_X''(0) - (M_X'(0))^2\)
  • The MGF uniquely determines the distribution of a random variable, if it exists in a neighborhood of \(t=0\).
NoteBernoulli MGF and Moments

Example 1.29 For a Bernoulli random variable with parameter \(p\). The MGF is given by \[M_X(t) = 1 - p + pe^t\]

Differentiating and evaluating at \(t=0\), we have:

  • \(M_X'(t) = pe^t\) so \(E[X] = M _X'(0) = p\)
  • \(M_X''(t) = pe^t\) so \(E[X^2 ] = M_X''(0) = p\)
  • \(Var(X) = M_X''(0) - (M_X'(0)) ^2 = p - p^2 = p(1-p)\)
NotePoisson MGF and Moments

Example 1.30 For a Poisson random variable with parameter \(\lambda\). The MGF is given by \[M_X(t) = e^{\lambda(e^t - 1)}\] Differentiating and evaluating at \(t=0\), we have: * \(M_X'(t) = \lambda e^t e^{\lambda(e^t - 1)}\) so \(E[X] = M_X'(0) = \lambda\) * \(M_X''(t) = \lambda e^t e^{\lambda(e ^t - 1)} + \lambda^2 e^{2t} e^{\lambda(e^t - 1)}\) so \(E[X^2] = M_X''(0) = \lambda + \lambda^2\) * \(Var(X) = M_X''(0) - (M_X'(0)) ^2 = \lambda + \lambda^2 - \lambda^2 = \lambda\)

NoteMFG and Moments example

Example 1.31 A random variable \(X\) has the following MFG: \[M_X(t) = \frac{3}{3 - t}\,,\qquad t<3\]

We can obtain the moments without knowledge of the PDF or CDF as follows:

  • \(M_X'(t) = \frac{3}{(3 - t)^2}\) so \(E[X] = M_X'(0) = \frac{1}{3}\)
  • \(M_X''(t) = \frac{6}{(3 - t)^3}\) so \(E[X^2] = M_X''(0) = \frac{ 6}{27}\)
  • \(Var(X) = M_X''(0) - (M_X'(0)) ^2 = \frac{6}{27} - \frac{1}{9} =\frac{6}{27} - \frac{3}{27}=\frac{1}{9}\)

Now, let \(X\) be an exponential random variable with rate parameter \(\lambda\). The MGF is given by \[M_X(t) = \int_0^\infty e^{tx} \lambda e^{-\lambda x} \, dx = \frac{\lambda}{\lambda - t}\] for \(t < \lambda\). So the only distribution with the MGF given above is an exponential distribution with rate parameter \(\lambda = 3\).

NoteExpectation and Independence

Proposition 1.5 If \(X\) and \(Y\) are independent random variables, then:

  • \(E[XY] = E[X]E[Y]\)

  • More generally, if \(g\) and \(h\) are functions, then \(E[g(X)h(Y)] = E[g(X)]E[h(Y)]\)

  • In particular,

\[ M_{X+Y}(t) =E[e^{t(X+Y)}] = E[e^{tX}e^{tY}] = E[e^{tX}]E[e^{tY}] = M_X(t)M_Y(t) \]

this means that the MGF of the sum of independent random variables is the product of their MGFs.

TipSum of Bernoulli Random Variables

Example 1.32 Let \(X_1, X_2, \ldots, X_n\) be independent Bernoulli random variables with parameter \(p\). Let \(S_n = X_1 + X_2 + \cdots + X_n\) be their sum. We can compute the MGF of \(S_n\) as follows: \[ M_{S_n}(t) = E[e^{tS_n}] = E[e^{t(X_1 + X_2 + \cdots + X_n)}] = E[e^{tX_1}e^{tX_2}\cdots e^{tX_n}] = E[e^{tX_1}]E[e^{tX_2}]\cdots E[e^{tX_n}] = (M_{X_1}(t))^n \] where we used the independence of the \(X_i\)’s. Since each \(X_i\) is a Bernoulli random variable with parameter \(p\), we have: \[ M_{X_i}(t) = 1 - p + pe^t \] for all \(i\). Therefore, we have: \[ M_{S_n}(t) = (1 - p + pe^t)^n \] which is the MGF of a Binomial random variable with parameters \(n\) and \(p\). Therefore, we conclude that \(S_n\) follows a Binomial distribution with parameters \(n\) and \(p\).

TipSum of Poisson Random Variables

Example 1.33 Let \(X_1, X_2, \ldots, X_n\) be independent Poisson random variables with parameters \(\lambda_1, \lambda_2, \ldots, \lambda_n\), respectively. Let \(S_n = X_1 + X_2 + \cdots + X_n\) be their sum. We can compute the MGF of \(S_n\) as follows: \[ M_{S_n}(t) = E[e^{tS_n}] = E[e^{t(X_1 + X_2 + \cdots + X_n)}] = E[e^{tX_1}e^{tX_2}\cdots e^{tX_n}] = E[e^{tX_1}]E[e^{tX_2}]\cdots E[e^{tX_n}] = \prod_{i=1}^n M_{X_i}(t) \] where we used the independence of the \(X_i\)’s. Since each \(X_i\) is a Poisson random variable with parameter \(\lambda_i\), we have: \[ M_{X_i}(t) = e^{\lambda_i(e^t - 1)} \] for all \(i\). Therefore, we have: \[ M_{S_n}(t) = \prod_{i=1}^n e^{\lambda_i(e^t - 1)} = e^{(\sum_{i=1}^n \lambda_i)(e^t - 1)} \] which is the MGF of a Poisson random variable with parameter \(\sum_{i=1}^n \lambda_i\). Therefore, we conclude that \(S_n\) follows a Poisson distribution with parameter \(\sum_{i=1}^n \lambda_i\).

TipSum of Exponential Random Variables

Example 1.34 Let \(X_1, X_2, \ldots, X_n\) be independent exponential random variables with rate parameter \(\lambda\). Let \(S_n = X_1 + X_2 + \cdots + X_n\) be their sum. We can compute the MGF of \(S_n\) as follows: \[ M_{S_n}(t) = E[e^{tS_n}] = E[e^{t(X_1 + X_2 + \cdots + X_n)}] = E[e^{tX_1}e^{tX_2}\cdots e^{tX_n}] = E[e^{tX_1}]E[e^{tX_2}]\cdots E[e^{tX_n}] = (M_{X_1}(t))^n \] where we used the independence of the \(X_i\)’s. Since each \(X_i\) is an exponential random variable with rate parameter \(\lambda\), we have: \[ M_{X_i}(t) = \frac{\lambda}{\lambda - t} \] for all \(i\) and for \(t < \lambda\). Therefore, we have: \[ M_{S_n}(t) = \left(\frac{\lambda}{\lambda - t}\right)^n \] which is the MGF of a Gamma random variable with shape parameter \(n\) and rate parameter \(\lambda\). Therefore, we conclude that \(S_n\) follows a Gamma distribution with shape parameter \(n\) and rate parameter \(\lambda\).

We have not defined the Gamma distribution yet. WE do this here for completeness. The Gamma distribution with shape parameter \(k\) and rate parameter \(\theta\) has the following PDF: \[ f_X(x) = \begin{cases} \frac{\theta^k}{\Gamma(k)} x^{k-1} e^{-\theta x} & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \]§ where \(\Gamma(k)\) is the Gamma function defined as: \[ \Gamma(k) = \int_0^\infty x^{k-1} e^{-x} \, dx \] The mean and variance of a Gamma random variable are given by: * \(E[X] = \frac{k}{\theta}\) * \(Var(X) = \frac{k}{\theta^2}\)

NoteCovariance and Correlation

Definition 1.30 The covariance between two random variables \(X\) and \(Y\), denoted by \(Cov(X, Y)\) is defined as: \[ Cov(X, Y) = E[(X - E[X])(Y - E[Y])] \] It is easy to show that: \[ Cov(X, Y) = E[XY] - E[X]E[Y] \] The covariance measures the linear relationship between two random variables. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases. The correlation coefficient between two random variables \(X\) and \(Y\), denoted by \(\rho_{X,Y}\) or \(Corr(X, Y)\), is defined as: \[ Corr(X, Y)=\rho_{X,Y} := \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} \] where \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

The correlation coefficient measures the strength and direction of the linear relationship between two random variables. It ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

TipExample of Covariance and Correlation

Example 1.35 Using the joint distribution of \(X_1\) and \(Z\) in Example 1.23, we can compute the covariance and correlation between \(X_1\) and \(Z\). We have:

  • \(E[X_1]=E[X_1^2] = P(X_1=1)= 0.5\)
  • \(E[Z]=E[Z^2]=P(Z=1) = 0.4\)
  • \(E[X_1Z] =P(X_1=1,Z_1=1)= 0.3\)

Then we have:

  • \(Cov(X_1, Z) = E[X_1Z] - E[X_1]E[Z] = 0.3 - 0.5 \times 0.4 = 0.1\)

  • \(Var(X_1)=E[X_1^2] - (E[X_1])^2 = 0.5(1- 0.5) = 0.25\)

  • \(Var(Z) = E[Z^2] - (E[Z])^2 = 0.4(1-0.4) = 0.24\)

  • \(Corr(X_1,Z)=\rho_{X_1,Z} = \frac{Cov(X_1, Z)}{\sqrt{Var(X_1)Var(Z)}} = \frac{0.1}{\sqrt{0.25 \times 0.24}} \approx 0.408\)

We can doublecheck the correlation result above by simulating a large number of coin flips and computing the sample correlation between \(X_1\) and \(Z\).

Code
import numpy as np
n = 10000
x1 = np.random.binomial(1, 0.5, n)
x2 = np.random.binomial(1, 0.7, n)
z = (x1 == x2).astype(int)
print("Sample correlation between X1 and Z:", np.corrcoef(x1, z)[0, 1])
Sample correlation between X1 and Z: 0.3960602084773958
TipExample of Covariance and Correlation

Example 1.36 Now using the joint distribution of \(X\) and \(Y\) in Example 1.24

\[ f_{X,Y}(x,y) = \begin{cases} 2 & \text{if } 0 < x < 1 \text{ and } 0 < y < x \\ 0 & \text{otherwise} \end{cases} \]

we can compute the covariance and correlation between \(X\) and \(Y\). We have:

  • \(E[X]=\int_0^1 x \cdot 2x \, dx = \frac{2}{3}\)
  • \(E[Y]=\int_0^1 y \cdot 2(1-y) \, dy = 1-2/3=1/3\)
  • \(E[X^2]=\int_0^1 x^2 \cdot 2x \, dx = \frac{1}{2}\)
  • \(E[Y^2]=\int_0^1 y^2 \cdot 2(1-y) \, dy = \frac{2}{3}-\frac{1}{2}=\frac{1}{6}\)
  • \(Var(X)=E[X^2] - (E[X])^2 = \frac{1}{2} - \left(\frac{2}{3}\right)^2 = \frac{1}{18}\)
  • \(Var(Y)=E[Y^2] - (E[Y])^2 = \frac{1}{6} - \left(\frac{1}{3}\right)^2 = \frac{1}{18}\)
  • \(E[XY]=\int_0^1 \int_0^x xy \cdot 2 \, dy \, dx = \int_0^1 x^3 \, dx = \frac{1}{4}\)

Then we have:

  • \(Cov(X, Y) = E[XY] - E[X]E[Y] = \frac{1}{4} - \frac{2}{3} \times \frac{1}{3} = \frac{1}{36}\)
  • \(Corr(X,Y)=\rho_{X,Y} = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}} = \frac{\frac{1}{36}}{\sqrt{\frac{1}{18} \times \frac{1}{18}}} = \frac{1}{2}\)

Now assume we have two independent random variables \(X\) and \(Y\) uniformly distributed on the interval \([0,1]\).

Then for any \(x\) and \(y\) in the interval \([0,1]\) such that \(x>y\), we have:

\[ \begin{aligned} F_{X,Y}(x,y|X>Y) &= P(X \leq x, Y \leq y|X>Y)\\ \rule{0in}{3ex}&=\frac{P(X\leq x,Y\leq y,X>Y)}{P(X>Y)}\\ \rule{0in}{3ex}&=\frac{\displaystyle\int_0^y \int_0^u f_X(u)f_Y(v)dvdu+ \int_y^x \int_0^y f_X(u)f_Y(v)dvdu} {\displaystyle \int_0^1 \int_0^u f_X(u)f_Y(v)dvdu}\\ \rule{0in}{3ex}&=\frac{\displaystyle\int_0^y \int_0^u dvdu+ \int_y^x \int_0^y dvdu} {\displaystyle \int_0^1 \int_0^u dvdu}\\ \rule{0in}{5ex}&=\frac{\frac{y^2}{2}+y(x-y)}{1/2}\\ \rule{0in}{4ex}&=y^2+2y(x-y)\,,\quad 0<y<x<1\\ \end{aligned} \]

which is the same cdf in Example 1.24.

This gives us a way to simulate the joint distribution of \(X\) and \(Y\) given \(X>Y\). We can first simulate two independent uniform random variables \(X\) and \(Y\) on the interval \([0,1]\), and then keep only the pairs \((X,Y)\) such that \(X>Y\). Wer show below a Python code to do this and in passing we double check the correlation result above.

Code
import numpy as np
import matplotlib.pyplot as plt

n = 10000
x = np.random.uniform(0, 1, n)
y = np.random.uniform(0, 1, n)
z = x>y
y = y[z]
x = x[z]

# can compute the mean of each variable

mean_x = np.mean(x)
mean_y = np.mean(y)
var_x = np.var(x)
var_y = np.var(y)
# display the means and variances
print("Mean of X:", mean_x)
print("Mean of Y:", mean_y)
print("Variance of X:", var_x)
print("Variance of Y:", var_y)
print("Correlation:", np.corrcoef(x, y)[0, 1])


plt.scatter(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter plot of X and Y')
plt.show()
Mean of X: 0.6639222935392972
Mean of Y: 0.32946047862478955
Variance of X: 0.0551733449988657
Variance of Y: 0.054582907852476645
Correlation: 0.4793903631453601

The scatter plot shows a negative correlation between \(X\) and \(Y\), which is consistent with our calculation of the correlation coefficient.

NoteCovariance and correlation under independence

Proposition 1.6 If \(X\) and \(Y\) are independent random variables, then \(E[XY] = E[X]E[Y]\) and therefore

  • \(Cov(X, Y) = 0\)
  • \(\rho_{X,Y} = 0\)